Discovery of genes positively modulating treatment effect using potential outcome framework and Bayesian update

Background The recent explosion of cancer genomics provides extensive information about mutations and gene expression changes in cancer. However, most of the identified gene mutations are not clinically utilized. It remains uncertain whether the presence of a certain genetic alteration will affect treatment response. Conventional statistics have limitations for causal inferences and are hard to gain sufficient power in genomic datasets. Here, we developed and evaluated a C-search algorithm for searching the causal genes that maximize the effect of the treatment. Methods The algorithm was developed based on the potential outcome framework and Bayesian posterior update. The precision of the algorithm was validated using a simulation dataset. The algorithm was implemented to a cBioPortal dataset. The genes discovered by the algorithm were externally validated within CancerSCAN screening data from Samsung Medical Center. Results Simulation data analysis showed that the C-search algorithm was able to identify nine causal genes out of ten. The C-search algorithm shows the discovery rate rapidly increasing until the 1500 data instances. Meanwhile, the log-rank test shows a slower increase in performance. The C-search algorithm was able to suggest nine causal genes from the cBioPortal Metabric dataset. Treating the patients with the causal genes is associated with better survival outcome in both the cBioPortal dataset and the CancerSCAN dataset which is used for external validation. Conclusions Our C-search algorithm demonstrated better performance to identify causal effects of the genes than multiple log-rank test analysis especially within a limited number of data. The result suggests that the C-search can discover the causal genes from various genetic datasets, where the number of samples is limited compared to the number of variables. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-022-01852-3.


Introduction
Identifying genomic sequences and analyzing data is a major focus in cancer studies [1]. An understanding of the causal relationship between therapeutic effect and genomic variances among tumors will allow individualized treatment and reduce unnecessary treatment.
There are open-access, open data sources, such as the cBioPortal for Cancer Genomics. Although a large amount of data is readily accessible to researchers, most of the identified gene mutations are not clinically utilized [2].
Conventional statistical analysis of genetic data consists of a series of single-statistic tests. The cumulative probability of false positives increases as the number of genes increases. To deal with multiple-testing problems, the false discovery rate (FDR) is used, which is "expected type I errors among the total number of rejected null hypotheses" [3]. Despite the approach, the dimension of the data significantly affects the statistical power of the test. In addition, conventional statistics only draw an association; therefore, distinguishing causal relationships from spurious associations is a challenge [4].
To draw causal effects from observational data, Rubin introduced a potential outcome framework [5,6] Individual levels of treatment effect are derived from a comparison of two potential outcomes. however, observing the exposed and unexposed outcomes at the same time is impossible. One of the methods to overcome this fundamental problem of causal inference is to compute the potential outcome from samples matched with similar covariate profiles [4,7,8]. However, due to the curse of dimensionality in cancer genomics, enough sample size may not be available to match the exposed and unexposed within a genetic subset [9].
We developed an algorithm called C-search that can estimate potential outcomes using the similarityweighted Monte Carlo method. We adopted Bayesian posterior update, which allows us to estimate the uncertainty of our decision boundary from small datasets without losing power [10]. This system was used to identify the causal genes that maximize the effect of the treatment. In this study, we compared the performance of causal gene discovery between the conventional statistical method and our C-search using a simulation dataset and an open-source gene dataset.

Pseudo-counterfactual assumption and similarity weighted Monte Carlo
Assume that individual i with variable X i is treated with the treatment variable T i . T i is a binary variable ( T i = 1 , if treated; T i = 0 , if not). There are two potential outcomes: Y (X i , T i = 1) , and Y (X i , T i = 0) . The causal effect of the treatment can be drawn from the comparison between both [6].
Here, we define f (X i ) as the outcome of an individual i with variable X i when treated, and g(X i ) as the oucome of the individual when not treated. Then, individual treatment effect ( ITE ) of the individual i , ITE(X i ) can be written as follows: However, we can observe only one potential outcome at most [4,8]. Therefore, we should infer the counterfactuals from an untreated data pool. We call them pseudocounterfactuals because they are not identical to the factual.
Draw an individual j with variable X j from the untreated data pool. Weight function W X i , X j is defined as the probability of similarity sim X i , X j between the factual and pseudo-counterfactual [11]. Using the similarity weighted Monte Carlo method [12], we could estimate ITE(X i ) as follows:

Measurement of the difference in survival outcome using Win probability (Pw i )
Survival outcome Y includes survival time and survival events. Measuring individual differences in survival outcomes is difficult because they are right-censored. One of the most well-established outcome measures for survival difference is the bi-partite ranking system, such as the Wilcoxon − Mann − Whitney statistics [13]. Adopting this concept, we assumed the comparison in outcome between two individuals i and j as a Bernoulli trial. If i lives longer than j , i will win a score. ITE for the survival outcome can be defined as the win probability ( Pw i : the chance that the treated individual i wins over its untreated counterpart), which follows a binomial distribution.
The beta distribution is a conjugate prior for the binomial distribution. If we consider the comparison between two individuals i and j as a simple Bernoulli trial, the posterior distribution after observing the score s (or observing s times of winning of i ) after N j trials can be defined as follows:

Update Win probability using similarity weighted Monte Carlo
In ideal settings where all individuals j j ∈ 1, 2, . . . N j in the counterpart group that are identical to the individual i , the outcome of j is a good estimator for the counterfactual outcome of i. However, an identical condition is impossible in the observation setting. We use the similarity weighted Monte Carlo method to update Pw i . We matched individual i with the individuals in the counterpart data pool and updated the score(s ) with the similarity weight W i, j calculated from the similarity sim i, j between i and j. The sim i, j can be the Euclidean distance in the original data space [14]. Other methods use a transformed one-dimensional score, that is, a regression function, such as a propensity function [15].
Here, we defined a basal function-a regression function that approximates the survival state. The covariates of individual i and matched controls j are projected onto the space through the basal functions P i (Y |X i ) and P j (Y |X j ). To incorporate the difference onto the similarity weight, we used the Boltzmann probability distribution: k is a constant and τ is the annealing temperature. These hyperparameters represent degrees of freedom.
Let s j be the score from a single comparison between i and j . The posterior distribution after observing a single comparison between i and j can be written as follows: The Pw i is estimated from the posterior distribution after observing N j the number of counterpart individuals.
We defined the observed clinical covariates as V i . Genetic alteration is represented by simple binary values (e.g., for each genetic profile g 1 , g 2 , g 3 , . . . , g N g , existing alteration is given a value of 1; if not, it is given a value of 0). The individual i has the genetic variable G i that consists of a set of genetic profile.
To estimate Pw i , we may use the similarity weight calculated from the basal function using clinical covariates and/or genetic covariates. We denote the weight of the basal function of clinical covariates as W V i, j and the weight using the basal function of the genetic covariates as W G i, j . Using Eq. 3, W V i, j and W G i, j can be written as follows: The win probabilities of individual i using both weights are as follows:

Causal gene suggestion
To find a single gene ( g k ) effect on the treatment effect, we assumed that each gene has an independent win probability P("win ′′ |g k ).
We used the similarity weighted Monte Carlo method to estimate P "win ′′ |X i , g k = 1 or individual i 's win probability Pw i g k = 1 (Eq. 7). Observing individual i 's win probability Pw i g k = 1 updates the prior distribution of P "win ′′ = Beta(α 0 , β 0 ) and the posterior is as follows: Sampling N k -number of individuals with g k = 1 , the posterior can be as follows: Posterior Pw i ∼Beta In reality, Pw i g k = 1 is not stationary because Pw i depends on individual variables V i , G i . To identify the marginal treatment effect, we need to balance all confounding variables using inverse propensity score weighting (IPW) as follows.
For the treatment decision, we need to know whether treating a patient with genetic alteration g k is beneficial with sufficient evidence. We can estimate the posterior distribution of Pw without g k in the same manner. If g k has a significant benefit for the treatment, the upper bound of the 95% confidence interval of P("win ′′ |g k = 0 ) should be lower than the lower bound of P("win ′′ |g k = 1).

Simulation data analysis
The C-search algorithm was validated with the simulation data (Additional file 1), where the 10 causal genes that positively modulated the treatment outcome were hidden among 300 genes. We compared the precision of the C-search algorithm with that of the conventional log-rank survival analysis. Both the C-search algorithm and conventional statistics suggested 10 possible causal genes. The precision of the algorithm was determined by the number of true causal genes among the suggested genes. The performance of the algorithm was evaluated using five-fold cross-validation. To balance the covariate profile, we used propensity score matching. To perform the multiple comparison tests, we set the FDR to 0.05 and controlled it with the Benjamini − Hochberg procedure [16,17].
The power of log-rank survival analysis depends on the number of data [18]. To demonstrate whether the algorithms are dependent on the number of data, 100, 200, 500, 1000, 1500, 2000, 2500, 3000, 3500, and 4000 simulation data were used for the analysis. As the number of data increases, both algorithms show better causal gene discovery. The C-search algorithm continuously improved its discovery performance until 1500 samples were obtained and then plateaued. Conventional statistics showed a linear improvement according to the number of samples. It required at least 4000 samples to show performance comparable to that of the C-search algorithm (Fig. 1).

Finding positive modulators from open-source data
We also analyzed the cBioPortal Metabric breast cancer data using the C-search algorithm to identify causal genes that are associated with improved outcomes of chemotherapy. The dataset includes gene mutation profiles of 173 genes and clinical data from 2433 patients with primary breast cancer [19]. Clinical data consisted of age, chemotherapy, radiation therapy, sex, survival lifetime, and survival events. Among the 2433 records, 964 records that have missing on clinical data were omitted. A total of 1,469 patients were included in the analysis (Additional file 1).
The C-search suggested nine positive modulators: PRKCZ, CLK3, CDKN2A, BRAF, KRAS, CASP8, JAK1, PRKACG, and SIK2. We allocated all patients who had any of them to the causal gene group and those who had not to the other gene group. If they are the true causal genes, treated patients should show a statistically significantly better prognosis compared to the untreated patients in the causal gene group. In addition, among the Fig. 1 The number of causal genes discovered by C-search and conventional statistics. The X-axis is the number of samples consisting of the simulation data. The Y-axis is the number of true causal genes among the 10 suggested causal genes that the algorithm discovered.
The C-search algorithm shows the discovery rate rapidly increasing until the 1500 data instances. The log-rank test shows a slower increase in performance patients who were treated, the causal gene group should show a better prognosis than the other gene group. There were no overall survival differences between the causal gene group and the other gene group (Fig. 2a). Among patients with causal genes, the treatment group showed a better prognosis than the untreated group (Fig. 2b). Meanwhile, in the other gene group, there were no statistically significant differences between the treated and untreated groups (Fig. 2c). Among the treated patients, the causal gene group showed significantly better survival than the other gene group (Fig. 2d). In the untreated group, the other gene group showed better survival outcomes compared with the causal gene group (Fig. 2e). We define the optimal policy as treating patients with causal genes and not treating patients without causal genes. The other policy is to treat patients in the other gene group and not to treat patients in the causal gene group. The Kaplan − Meier survival curve of the optimal policy showed better survival outcomes than the other policies (Fig. 2f ).

Conventional log-rank analysis of open-source data
To discover causal genes using conventional statistics from the cBioPortal dataset, we performed log-rank survival analysis. As the dataset includes gene mutation profiles of 173 genes, multiple log-rank survival analyses were applied for each mutation profile. For each gene mutation, the patients treated were divided into two groups: one group consisted of patients who harbored the mutation and one group of patients without the mutation. Survival outcomes were compared between the two groups. Propensity score matching was used to balance the covariate profile. We set the FDR to 0.01 and controlled it using the Benjamini − Hochberg procedure [16,17].
A total of 10 genes were shown to be correlated with positive treatment outcomes: EGFR, CLK3, PTEN, CDH1, GATA3, KRAS, RB1, PRKACG, NEK1, and NRAS. Patients who have any of these are allocated to the causal gene group, and those who do not are assigned to other gene group. Kaplan − Meier survival curves comparing the overall survival differences between both groups are depicted in Fig. 3a. No survival differences were observed between patients with causal genes and without causal genes. In both the causal gene group and the other gene group, treatment was not associated with positive outcomes compared with no treatment (Fig. 3b, c). Meanwhile, in the treated group, the causal gene group showed better survival than the other gene group (Fig. 3d). Among the untreated patients, the causal gene group and the other gene group had no statistically significant differences in survival (Fig. 3e). Following optimal policy demonstrated better survival than following the other policy (Fig. 3f ).

Comparison between C-search and conventional log-rank analysis
We compared the Kaplan − Meier survival curve of the following optimal policy, which was determined by the genes that each algorithm discovered. The survival outcome following the C-search policy showed statistically significant better survival outcomes (Fig. 4).

Validation with external data
To determine whether the genes found by the algorithm will act as causal genes in the other dataset, we used CancerSCAN screening data from Samsung Medical Center. CancerSCAN is a custom panel developed by the Samsung Genomic Institute [20]. The usage of the data was approved by the institutional review boards of the participating institutions (Samsung Medical Center 2019-11-127).
CancerSCAN data consists of 559 breast cancer samples obtained at the Samsung Medical Center from January 2014 to September 2016. Mutation profiles of 81 genes and clinical data on age, chemotherapy, radiation therapy, sex, survival time, and survival events were included (Additional file 1). Among the causal genes suggested by C-search, mutation profiles of BRAF, KRAS, CDKN2A, and JAK1 were found in the CancerSCAN dataset. We assigned patients to the C-search causal gene group who acquired at least one of the mutations. The CDH1, EGFR, KRAS, PTEN, and RB1 are genes suggested by log-rank analysis whose mutation profiles exist in the CancerSCAN dataset. Patients with these mutations are assigned to the conventional statistics causal gene group. Figure 5a shows the significant survival difference between the treated and the untreated in the C-search causal gene group. The optimal policy suggested by the C-search showed a significantly better survival outcome (Fig. 5b). Meanwhile, among conventional statistical causal gene group, the treated did not demonstrate statistically better survival compared to the untreated (Fig. 5c). The survival outcomes following the optimal policy suggested by conventional statistics and those of the other policies are not statistically different (Fig. 5d).

Discussion
Biotechnological breakthroughs in gene profiling have led to an increased focus on individualized precision therapy [21]. Clinicians want to treat patients who will benefit the most from the therapy while avoiding treatment that will not benefit or even get harm from therapy. There are studies on genetic assays to predict prognosis or response to treatment for breast cancer [22]. BluePrint molecular subtyping profile uses 80 genes to determine the sensitivity to adjuvant treatment [23]. Prosigna Breast Cancer Prognostic Gene Signature Assay utilizes the PAM50 test, which identifies gene signatures specific to breast cancer subtypes (luminal A/B, HRE2, basal-like) [24]. Both studies used genomic profiles to infer the molecular subtypes of cancer and drew a correlation between the subtypes and the prognosis or response to treatment. However, since the correlation does not infer causation, we cannot conclude that the specific genomic profile results in a positive response to treatment. In contrast, our C-search algorithm uses a potential outcome framework to study the causal effect of each gene on treatment.
The estimation of causal effects from observational studies can be done in a number of ways. Methods based on propensity scores match, stratify, and/or inversely weight covariates that affect treatment allocation [25,26]. G-computation implements regression models [27]. Mendelian randomization uses germline gene mutations as instruments to make causal inference [28,29]. These methods are used to estimate average treatment effect of the target population. To estimate and compare the treatment effect of the patient who have a specific genetic mutation, one must calculate conditional causal effect, which is the average treatment effect of a subgroup of patients with that mutation [30]. However, due to the curse of dimensionality in cancer genomics, some subgroups may lack sufficient sample size to draw meaningful conclusions [31]. In addition, when the algorithm handles a smaller dataset, the result drawn from the inference has a considerable amount of uncertainty. It is important to incorporate uncertainty into the prediction of the algorithm [32,33]. The C-search algorithm updates the gene's win probability with a Bayesian update, thus, reflecting its uncertainty in the analysis. Pseudocode for the algorithm and the computational complexity are shown in Additional file 1.
We demonstrated that the C-search algorithm can identify causal genes from a simulation dataset that includes hidden confounders. Compared to conventional log-rank analysis, C-search requires fewer data to gain sufficient power to find the causal genes. Therefore, the C-search may find more candidate causal genes than conventional association studies using genomic data. When there are not enough data subsets, it is important as we used Bayesian update to estimate the gene's win probability.
Both C-search and log-rank analysis successfully found positive modulators in the Metabric dataset (Figs. 2, 3), yet the gene set found by C-search showed better results than log-rank analysis (Fig. 4). Among the positive regulators that C-search and conventional log-rank analysis found, two genes are found in common in both algorithms: KRAS and PRKACG . The KRAS gene targets several miR-NAs to enhance chemotherapy in acute myeloid leukemia, lung cancer, breast cancer, and gallbladder cancer [34][35][36]. PRKACG is a gene that encodes the protein kinase A subunit Cγ, whose role in cancer has not been elucidated. The causal genes suggested by both algorithms contained relatively few overlapping genes. This may be due to intercorrelated genes, at least to some extent; therefore, one of the co-expressed genes may be selected for the set as a predictor and yield comparable results [37].
Our study has several limitations. There were only a few clinical variables available in the open-source dataset; therefore, hidden confounders may affect the performance of the algorithm. IPW may result in bias in this setting [15,38]. However, the algorithm performance in the simulation data, including 10 hidden confounders, showed better results than log-rank analysis. As there are no golden standard datasets to evaluate the performance of the algorithm to determine the treatment modulating effect of genetic variables, we can only evaluate the algorithm's performance indirectly. Using the CancerSCAN dataset, we were able to externally validate that the causal genes suggested by C-search showed comparable results in the external dataset.

Conclusion
We proposed an algorithm that uses a potential outcome framework and Bayesian updating, inferring the causal effect of the genetic variable on treatment outcomes. The proposed algorithm was shown to find causal genes from the simulation data in a relatively small number of samples compared to the log-rank analysis. It also showed its performance in finding positive treatment modulators from the open-source breast cancer dataset, which is validated with external data. The C-search algorithm may be applied to various types of datasets where the number of samples is limited compared to the number of variables.
Additional file 1. How to simulate data generation, demographics of cBioPortal dataset, demographics of CancerSCAN dataset, pseudocode of the C-search Algorithm, computational complexity